PBMC Vignette

Welcome to Amethyst! The goal of Amethyst is to facilitate comprehensive analysis tools for single-cell methylation data. If you use this package or code on our Github for your work, please cite our manuscript.

PACKAGE INSTALLATION

If you are new to R, you may need to install some of the dependencies:

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

library("BiocManager")
BiocManager::install(c("caret", "devtools", "data.table", "dplyr", "furrr", "future", "future.apply",
  "ggplot2", "grDevices", "gridExtra", "igraph", "irlba", "janitor", "Matrix", "methods", "pheatmap",
  "plotly", "plyr", "purrr", "randomForest", "rhdf5", "rtracklayer", "Rtsne", "scales", "stats", "stringr", 
  "tibble", "tidyr", "umap", "utils"))

Next, install a few additional dependencies found on Github, including amethyst itself.

devtools::install_github("JinmiaoChenLab/Rphenograph")
devtools::install_github("KrishnaswamyLab/MAGIC/Rmagic")
devtools::install_github("TomKellyGenetics/leiden")
devtools::install_github("lrylaarsdam/amethyst")

Now load libraries into R:

library(amethyst)
library(data.table)
library(ggplot2)
library(dplyr)
library(tibble)
library(tidyr)
library(plyr)
library(future)
library(furrr)
library(purrr)
library(cowplot)
library(leiden)
library(pheatmap)

LOADING PRACTICE DATA

First, download the practice data. This vignette comes with site-level CpG methylation information from 50 human banked PBMCs. Download the h5 file and associated metadata with the following commands.

Note: By default, data will download to the “~/Downloads/” folder. Change if a different directory is desired.

download.file("https://adeylabopen.s3.us-west-2.amazonaws.com/amethyst/pbmc_vignette.h5", "~/Downloads/pbmc_vignette.h5", method = "curl") # Contains site-level methylation information for each cell
download.file("https://raw.githubusercontent.com/lrylaarsdam/amethyst/main/vignettes/pbmc_vignette/pbmc_vignette_cellInfo.txt", "~/Downloads/pbmc_vignette_cellInfo.txt") # Summary QC statistics for each cell
download.file("https://raw.githubusercontent.com/lrylaarsdam/amethyst/main/vignettes/pbmc_vignette/pbmc_vignette.annot", "~/Downloads/pbmc_vignette.annot") # Simulated batch metadata

Here is the structure of the h5 file:

ASSEMBLING THE AMETHYST OBJECT

Now construct an Amethyst object. This object is a convenient method for storing all our project information in one place. We will add paths to the h5 file, metadata, aggregated methylation values, results, and more. Each type of data goes in specific “slots” so Amethyst knows where to retrieve it. Generate an empty object with the createObject command.

obj <- createObject()

Premethyst-specific assembly instructions

If you used the Premethyst pipeline for initial processing, Amethyst has helper functions to add useful metadata about each cell from the intermediate output metric files. This includes quality control metrics contained in the .cellInfo.txt file or .annot files.

obj <- addCellInfo(obj, file = "~/Downloads/pbmc_vignette_cellInfo.txt")
obj <- addAnnot(obj, file = "~/Downloads/pbmc_vignette.annot", name = "batch") 

Next, we need to specify the location of the h5 file containing site-level methylation data for each barcode. In this case, every barcode belongs to the same h5 file, but an unlimited number of h5 files can be used in the same object.

Note: If you are combining experiments with overlapping barcodes, please see our vignette for addressing this. The h5paths slot is set up differently. Also, please note we have changed the heading from ‘paths’ (v0.0.0.9000) to ‘path’ (v1.0.0) to be more consistent, given the addition of the barcode and prefix columns.

obj@h5paths <- data.frame(barcode = rownames(obj@metadata), path = rep("~/Downloads/pbmc_vignette.h5", length(rownames(obj@metadata))))
head(obj@h5paths)
##                        barcode                         path
## 1 ACGCGACGGCACGAGAATCACTGTCATG ~/Downloads/pbmc_vignette.h5
## 2 ACGCGACGGCTATACCGAAGCGCCTATA ~/Downloads/pbmc_vignette.h5
## 3 ACTTCTGCCAACGCTTCGTCAGTACTCC ~/Downloads/pbmc_vignette.h5
## 4 ACTTCTGCCAACGCTTCGTCGATGATCC ~/Downloads/pbmc_vignette.h5
## 5 ACTTCTGCCAAGATAGAGGTCGTCTGCG ~/Downloads/pbmc_vignette.h5
## 6 ACTTCTGCCAATAATCGCGGCTGATGCT ~/Downloads/pbmc_vignette.h5

ScaleMethyl-specific assembly instructions

Amethyst has a helper loading function for output of the ScaleMethyl pipeline. This will generate an amethyst object, add metadata, and load the specified matrices.

Note: If you are combining experiments with overlapping barcodes, please see our vignette for addressing this. The h5paths slot is set up differently. Also, please note we have changed the heading from ‘paths’ (v0.0.0.9000) to ‘path’ (v1.0.0) to be more consistent, given the addition of the barcode and prefix columns.

# This is just for real data illustration purposes - do not run if you are processing vignette data.
obj <- createScaleObject(directory = "~/Downloads/ScaleMethyl.out/samples", genomeMatrices = list("CG.score", "CH"))

Instructions if using neither Premethyst nor ScaleMethyl

Any pre-processing platform can be used. The important thing is that the data is in the expected structure. The diagram below illustrates each slot and its respective data contents.

Filtering

It can be helpful to filter cells with outlying coverage values right away so downstream functions don’t perform calculations for cells that will not be used. This can easily be done with dplyr logic. First, view the coverage distribution, then filter as necessary.

Note: Vignette data has been pre-filtered. We recommend cells have a minimum of 1M cytosines covered, if possible. Filter by any additional metrics in your metadata as needed.

ggplot(obj@metadata, aes(x = cov)) + geom_histogram(bins = 10) + theme_classic()

obj@metadata <- obj@metadata |> dplyr::filter(cov > 100000 & cov < 40000000)

CLUSTERING

The next step is to cluster cells, which we typically do based on methylation values over fixed genomic windows. Depending on the size and depth of your dataset, calculating methylation levels over features can be a computationally intensive step. There are three ways to do this (only do one):

  1. Calculate genomic windows in R with makeWindows
  2. Use Facet to pre-compute methylation levels over genomic regions and store in the h5 file
  3. Use a different method - like MethSCAn - and add results to the Amethyst object

Note: Clustering methods for single-cell methylation data, as a newer modality, will require more hands-on interpretation of results. Please thoroughly check that your results are not driven by technical issues (like coverage bias) and expect that this step will take some optimization.

1. Calculate methylation levels over genomic windows in R using Amethyst functions

The most straightforward method is to calculate methylation levels over features with Amethyst in R. First, perform an initial indexing step, which determines the locations corresponding to each chromosome in every h5 file. This reduces the computational load by only calculating across one at a time.

Note: If you are using non-conventional chromosomes, please provide a whitelist (see function parameters).
Note: You may have to copy/paste any code reading the h5 file (e.g., makeWindows) directly into the console instead of running the chunk, if using the .Rmd template.

obj@index[["chr_cg"]] <- indexChr(obj, type = "CG", threads = 1) 
obj@genomeMatrices[["cg_100k_score"]] <- makeWindows(obj,
                                                     stepsize = 100000, 
                                                     type = "CG", 
                                                     metric = "score", 
                                                     threads = 1, 
                                                     index = "chr_cg", 
                                                     nmin = 2) 

2. Calculate methylation levels over genomic windows using Facet

makeWindows can be memory-intensive. If you have very large datasets, it may be preferable to apply Facet, a Python-based helper script to compute windows in a more efficient manner. Results are stored in the h5 file and loaded as needed. For installation instructions and examples, see Facet documentation.

Note: Skip to the “clustering continued” section if you used makeWindows to calculate 100kb window methylation score.

# example of how to compute aggregated windows with facet
facet agg -u 100000 brain_vignette_facet.h5

If pre-computed windows are stored in the h5 file, load to the genomeMatrices slot with loadWindows. An indexing step is not needed for Facet.

obj@genomeMatrices[["cg_100k_score_facet"]] <- loadWindows(obj, name = "100000", nmin = 2, metric = "score")

3. Integrate an alternative method

There are many methods out there for clustering. One may wish to implement an alternative strategy, like clustering with MethSCAn variably methylated regions. This worked quite well for our small, high-coverage dataset. Results from other methods can easily be dropped in to an Amethyst object and analysis can proceed as normal. Please see our vignette for example alternative clustering methods. In the end, all you have to do is assign the output to a new reductions result:

# theoretical example of 2D umap projection from methscan vmrs - do not run
obj@reductions[["umap_methscan_vmrs"]] <- methscan_result
colnames(obj@reductions[["umap_methscan_vmrs"]]) <- c("dim_x", "dim_y")

Clustering with Amethyst or Facet - continued

You may want to remove windows where most values are NA. In this case, since the vignette data is high coverage and the genomic windows are large, I am going to filter for at least 90% of the cells have values in that window. The appropriate threshold will highly depend on how big the windows are and how many cells you have.

Note: Please adjust this threshold in a dataset-specific manner. If you use short windows, a 90% threshold is way too high.

# Check that you won't be filtering too many rows first
nrow(obj@genomeMatrices[["cg_100k_score"]][rowSums(!is.na(obj@genomeMatrices[["cg_100k_score"]])) >= nrow(obj@metadata)*.9, ])
## [1] 28422
# Proceed if the number passing is reasonable (this is somewhat subjective based on your dataset and windowing strategy)
obj@genomeMatrices[["cg_100k_score"]] <- obj@genomeMatrices[["cg_100k_score"]][rowSums(!is.na(obj@genomeMatrices[["cg_100k_score"]])) >= nrow(obj@metadata)*.9, ]

Next, perform dimensionality reduction. If you are unsure how many dimensions to use, the dimEstimate function can estimate the number needed to explain the desired variance threshold.

Note: In this example, the number of requested output dimensions is low because pbmc_vignette.h5 has 50 cells. Adjust according to your data. Skip if it’s not helpful.

dimEstimate(obj, genomeMatrices = c("cg_100k_score"), dims = c(10), threshold = 0.95)
## cg_100k_score 
##             7

As suggested, we will reduce the data from “cg_100k_score” into seven dimensions using the irlba package, which performs fast truncated singular value decomposition. Feel free to implement alternative methods if desired.

set.seed(111)
# name the result whatever you want. I like descriptive names, at the cost of length.
obj@reductions[["irlba_cg_100k_score"]] <- runIrlba(obj, genomeMatrices = c("cg_100k_score"), dims = c(7), replaceNA = c(0))
# Optional; helps reduce coverage bias in clustering. Run both ways and see if results are improved.
obj@reductions[["irlba_cg_100k_score_gam"]] <- regressCovBias(obj, reduction = "irlba_cg_100k_score", method = "gam") 

Now determine cluster membership with runCluster. Either a Louvain-based or Leiden-based strategy can be used. Assign the output column name, which will be added to the metadata. This update was implemented so one can save multiple iterations within the same object.

Note: In this example, k is low because pbmc_vignette.h5 has 50 cells. Increase to 50 or so for larger datasets, depending on number of cells and expected heterogeneity.

set.seed(111)
# run clustering with louvain-based method
obj <- runCluster(obj, k = 10, reduction = "irlba_cg_100k_score_gam", method = "louvain", colname = "louvain_cluster_100k") 
## Run Rphenograph starts:
##   -Input data of 50 rows and 7 columns
##   -k is set to 10
##   Finding nearest neighbors...DONE ~ 0.001 s
##   Compute jaccard coefficient between nearest-neighbor sets...DONE ~ 0.002 s
##   Build undirected graph from the weighted links...DONE ~ 0.009 s
##   Run louvain clustering on the graph ...DONE ~ 0.002 s
## Run Rphenograph DONE, totally takes 0.0139999999999958s.
##   Return a community class
##   -Modularity value: 0.5277906 
##   -Number of clusters: 4
# example: run leiden-based algorithm and store within the same object
obj <- runCluster(obj, k = 10, reduction = "irlba_cg_100k_score_gam", method = "leiden", colname = "leiden_cluster_100k") 

# look at how results are stored
head(obj@metadata)
##                                   cov  cg_cov mcg_pct   ch_cov mch_pct batch
## ACGCGACGGCACGAGAATCACTGTCATG 27258073  923300   82.21 21341679    0.54     2
## ACGCGACGGCTATACCGAAGCGCCTATA 29188802  918053   81.16 22528736    0.53     2
## ACTTCTGCCAACGCTTCGTCAGTACTCC 25963119 1039143   71.63 21429728    0.45     2
## ACTTCTGCCAACGCTTCGTCGATGATCC 29345922 1144694   74.76 24104451    0.45     1
## ACTTCTGCCAAGATAGAGGTCGTCTGCG 27393943 1249755   68.49 23454887    0.54     1
## ACTTCTGCCAATAATCGCGGCTGATGCT 28318992 1264972   69.86 23883025    0.46     1
##                              louvain_cluster_100k leiden_cluster_100k
## ACGCGACGGCACGAGAATCACTGTCATG                    1                   4
## ACGCGACGGCTATACCGAAGCGCCTATA                    1                   3
## ACTTCTGCCAACGCTTCGTCAGTACTCC                    2                   2
## ACTTCTGCCAACGCTTCGTCGATGATCC                    2                   2
## ACTTCTGCCAAGATAGAGGTCGTCTGCG                    2                   2
## ACTTCTGCCAATAATCGCGGCTGATGCT                    2                   2

Next, project to 2D with runUmap and/or runTsne. Assign the output to a reductions slot. Like the runCluster update, this allows multiple projections to be stored within the same object.

Note: Neighbors and perplexity values are small because this dataset is 50 cells. Increase for larger datasets.

set.seed(111)
# name the result whatever you want. I like descriptive names, at the cost of length.
obj@reductions[["umap_irlba_cg_100k_score"]] <- runUmap(obj, neighbors = 5, dist = 0.05, 
                                                        method = "euclidean", reduction = "irlba_cg_100k_score_gam") 
obj@reductions[["tsne_irlba_cg_100k_score"]] <- runTsne(obj, perplexity = 5, method = "euclidean", 
                                                        theta = 0.2, reduction = "irlba_cg_100k_score_gam") 

Visualizing the results

First, plot the UMAP or TSNE coordinates of the cells with the color corresponding to cluster membership.

p1 <- dimFeature(obj, colorBy = louvain_cluster_100k, reduction = "umap_irlba_cg_100k_score", pointSize = .8) + ggtitle("UMAP CG 100k score")
p2 <- dimFeature(obj, colorBy = louvain_cluster_100k, reduction = "tsne_irlba_cg_100k_score", pointSize = .8) + ggtitle("TSNE CG 100k score")
plot_grid(p1, p2)

Re-cluster if needed

You might find that fixed genomic windows don’t give you good resolution of groups. Any feature set can be used for dimensionality reduction input. The makeWindows function can also calculate methylation levels over a bed file or genes (Note: Not recommended unless you are calculating %mCH.) Here is another clustering example using a set of pre-identified PBMC variably methylated regions (VMRs).

download.file("https://raw.githubusercontent.com/lrylaarsdam/amethyst/main/vignettes/pbmc_vignette/pbmc_highconfidence_dmrs.bed", "~/Downloads/pbmc_vmr.bed") 
obj@genomeMatrices[["pbmc_vmrs"]] <- makeWindows(obj, bed = "~/Downloads/pbmc_vmr.bed", type = "CG", metric = "percent", threads = 1, index = "chr_cg", nmin = 2)
obj@genomeMatrices[["pbmc_vmrs"]] <- obj@genomeMatrices[["pbmc_vmrs"]][rowSums(!is.na(obj@genomeMatrices[["pbmc_vmrs"]])) >= nrow(obj@metadata)*.2, ] 

Now re-run irlba with the DMR-based windows:

dimEstimate(obj, genomeMatrices = c("pbmc_vmrs"), dims = c(10), threshold = 0.90)
## pbmc_vmrs 
##         8
set.seed(111)
obj@reductions[["irlba_cg_vmr_pct"]] <- runIrlba(obj, genomeMatrices = c("pbmc_vmrs"), dims = c(8), replaceNA = c(0))

Now re-run clustering and UMAP:

set.seed(111)
obj <- runCluster(obj, k = 10, reduction = "irlba_cg_vmr_pct", method = "louvain", colname = "louvain_vmr_cluster") # consider increasing k_phenograph to 50 for larger datasets
## Run Rphenograph starts:
##   -Input data of 50 rows and 8 columns
##   -k is set to 10
##   Finding nearest neighbors...DONE ~ 0 s
##   Compute jaccard coefficient between nearest-neighbor sets...DONE ~ 0.002 s
##   Build undirected graph from the weighted links...DONE ~ 0.003 s
##   Run louvain clustering on the graph ...DONE ~ 0 s
## Run Rphenograph DONE, totally takes 0.00500000000000256s.
##   Return a community class
##   -Modularity value: 0.5383537 
##   -Number of clusters: 4
set.seed(123)
obj@reductions[["umap_irlba_cg_vmr_pct"]] <- runUmap(obj, neighbors = 5, dist = 0.05, method = "euclidean", reduction = "irlba_cg_vmr_pct") 
dimFeature(obj, colorBy = louvain_vmr_cluster, reduction = "umap_irlba_cg_vmr_pct", pointSize = .8)

In this context, VMRs worked much better than genomic windows for clustering, so we will continue with this dimensionality reduction.

Continue exploring clustering results

dimFeature uses ggplot logic, so you can easily modify plots as needed.For example:

dimFeature(obj, colorBy = louvain_vmr_cluster, reduction = "umap_irlba_cg_vmr_pct", pointSize = .8) + facet_wrap(vars(batch)) # Batch is simulated to illustrate function utility. Any column in the metadata will work.

Show the distribution of cluster membership between samples with sampleComp. Plots can be easily modified with ggplot command logic.

Note: Since there are so few cells, the variation in proportion between batches is not concerning.

sampleComp(obj, groupBy = "batch", colorBy = "louvain_vmr_cluster") 

If you want to make the UMAP/TSNE plots look nicer, Amethyst provides many built-in color palettes:

testPalette(output = "swatch", n = length(unique(obj@metadata$louvain_vmr_cluster)))

pal <- c("#F9AB60", "#E7576E", "#630661", "#B5DCA5") # makePalette(option = 7, n = 4) 
dimFeature(obj, colorBy = louvain_vmr_cluster, colors = pal, pointSize = .8, reduction = "umap_irlba_cg_vmr_pct")

dimFeature is useful for visualizing how different parameters in the cellInfo file are distributed throughout the UMAP:

p1 <- dimFeature(obj, reduction = "umap_irlba_cg_vmr_pct", colorBy = log(cov), pointSize = .8) + 
  scale_color_gradientn(colors = c("black", "turquoise", "gold", "red")) + ggtitle("Coverage distribution")
p2 <- dimFeature(obj, reduction = "umap_irlba_cg_vmr_pct", colorBy = mcg_pct, pointSize = .8) + 
  scale_color_gradientn(colors = c("black", "turquoise", "gold", "red")) + ggtitle("Global %mCG distribution")
plot_grid(p1, p2)

ANNOTATION

Now that we have clusters, the next step is determining their biological identity. This can be quite challenging for a novel modality like methylation. There are a couple ways we recommend going about this:

  1. mCG hypomethylation over regulatory elements of canonical marker genes
  2. Compare to a reference if possible

1. Use mCG patterns over canonical marker genes

If you have an in-depth knowledge of the system, one useful method is to look at mCG hypomethylation over canonical marker genes. The first step is to load an annotation file for the reference genome so Amethyst knows the coordinates for each gene.

obj@ref <- makeRef("hg38")

Next, calculate methylation levels in short genomic windows for each cluster. We recommend 500bp windows for real data, but 1000 are used here since the vignette dataset is only 50 cells.
Note: To avoid the hassle of recomputing these every time clusters are adjusted, one can also calculate sliding smoothed windows with Facet for each cell and store in the h5 file, then re-calculate/load more rapidly per group with loadSmoothedWindows. See ?loadSmoothedWindows for parameter specifications.

cluster1kbwindows <- calcSmoothedWindows(obj, 
                                         type = "CG", 
                                         threads = 1,
                                         step = 1000, # change to 500 for real data unless you have really low coverage
                                         smooth = 3,
                                         genome = "hg38",
                                         index = "chr_cg",
                                         groupBy = "louvain_vmr_cluster",
                                         returnSumMatrix = TRUE, # save sum matrix for DMR analysis
                                         returnPctMatrix = TRUE)
obj@tracks[["cg_cluster_tracks"]] <- cluster1kbwindows[["pct_matrix"]]

Now you can view methylation patterns over key marker genes:

Note: Blue is hypomethylated. Legend has been ommitted for plot clarity. See legend with “legend = T”.

heatMap(obj, 
        genes = c("SPI1", "CD2", "S100A8", "CD79A", "ELANE", "MPO", "MPEG1", "IRF8", "CD74", "CD3E", "CD3D", "KLRB1"), 
        track = "cg_cluster_tracks", 
        nrow = 4,
        legend = F)

If you want to simultaneously view another other information - like regulatory elements - these can be plotted as “tracks” below. In this example, we will use Encode candidate cis-regulatory elements (cCREs). The track structure should be a named list of data tables with chr, start, and end columns. First, download and split tracks into the expected structure.

# note this is for hg38 build only
library(rtracklayer)

download.file("https://hgdownload.soe.ucsc.edu/gbdb/hg38/encode3/ccre/encodeCcreCombined.bb", "~/Downloads/encodeCcreCombined.bb") 
ccre <- as.data.table(rtracklayer::import("~/Downloads/encodeCcreCombined.bb"))
setnames(ccre, "seqnames", "chr")
ccre_tracks <- split(ccre, ccre$ucscLabel)
ccre_tracks <- lapply(ccre_tracks, function(x) {x[,.(chr, start, end)]})

Now plot tracks under the heatMaps:

heatMap(obj, 
        genes = c("CD2", "IRF8", "CD74"), 
        track = "cg_cluster_tracks", 
        nrow = 1,
        legend = T,
        extraTracks = ccre_tracks)

As you can see from the heatMaps, promoters are sometimes universally hypomethylated or not at the predicted site. Methylation patterns over gene bodies are highly complex. Because of this, we prefer these tools to visualize over the whole gene body and surrounding context as opposed to aggregated metrics.

If you want to view other overlapping genes, use the “regions” parameter of heatMap:

heatMap(obj, 
        regions = getGeneCoords(obj@ref, gene = "S100A8"), # getGeneCoords is a shortcut to produce "chr1_153390032_153391073"
        trackOverhang = 10000,
        remove = "ENS", # optional; uses grep to remove genes that may be outside the question of interest
        track = "cg_cluster_tracks", 
        extraTracks = ccre_tracks)

In addition to heatMap, you can also view methylation levels with the histograM function. Please see function documentation for a full list of which parameters can be changed.

histograM(obj, 
          baseline = "mean", # can plot from either mean or 0; see documentation
          orientation = "cols",
          genes = "CD3D", 
          track = "cg_cluster_tracks",
          legend = F,
          extraTracks = ccre_tracks)
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

However, it can still be useful to look at aggregated metrics, especially when analyzing brain data. In this example we will calculate %mCG for predicted promoter regions. As previously mentioned, this is fraught with caveats, but we will use here for illustration purposes. It works well for some genes. In this example we will use all protein coding genes, but a subset can easily be used.

protein_coding <- unique(obj@ref |> dplyr::filter(type == "gene" & gene_type == "protein_coding" & seqid != "chrM") |> dplyr::pull(gene_name))
obj@genomeMatrices[["cg_promoters"]] <- makeWindows(obj, 
                                                     genes = protein_coding,
                                                     promoter = TRUE, 
                                                     type = "CG", 
                                                     metric = "percent", 
                                                     threads = 1, 
                                                     index = "chr_cg", 
                                                     nmin = 5) 

# subsetting to genes with values in at least 5 cells (10%) to reduce matrix size
obj@genomeMatrices[["cg_promoters"]] <- obj@genomeMatrices[["cg_promoters"]][rowSums(!is.na(obj@genomeMatrices[["cg_promoters"]])) >= nrow(obj@metadata) * 0.10, ]

Now you can view average %mCG of marker gene promoters by cluster with functions like dotM. This plotting tool accepts two modalities - one will determine dot size and the other color. We will calculate z score for color.

obj@genomeMatrices[["cg_promoters_z"]] <- as.data.frame(scale(obj@genomeMatrices[["cg_promoters"]], center = TRUE, scale = TRUE))
genes <- c("SPI1", "CD2", "S100A8", "CD79A", "ELANE", "MPO", "MPEG1", "IRF8", "CD74", "CD3E", "CD3D", "KLRB1")
dotM(obj, genes = genes, groupBy = "louvain_vmr_cluster", sizeMatrix = "cg_promoters", colorMatrix = "cg_promoters_z")

Again, easily modify with ggplot logic as desired:

Note: In this small dataset, lack of dot means no value was collected

dotM(obj, genes = genes, groupBy = "louvain_vmr_cluster", sizeMatrix = "cg_promoters", colorMatrix = "cg_promoters_z", splitBy = "batch") + 
  scale_color_gradientn(colors =  c("#005eff", "grey60", "#ffa600")) + scale_size(range = c(1, 8))

It can also be helpful to use a less directed approach when determining differences between groups. For this vignette, we are just testing a subset of known marker genes, but for thorough data analysis it would better to test all protein coding genes.

cluster_promoter_markers <- findClusterMarkers(obj, 
                                               nmin = 5, # increase to at least 10 for larger datasets
                                               matrix = "cg_promoters", 
                                               genes = genes, # short subset for illustration purposes
                                               groupBy = "louvain_vmr_cluster",
                                               method = "BH", # recommended to use "bonferroni" for real data
                                               threads = 1)
cluster_promoter_markers <- cluster_promoter_markers |> dplyr::filter(p.adj < 0.1) # Hardly any results because of dataset size; lower p.adj threshold for larger datsets

When plotting the result, we can indeed see that the promoter is hypomethylated in group 1.

dotM(obj, genes = cluster_promoter_markers$gene, groupBy = "louvain_vmr_cluster", colorMatrix = "cg_promoters", sizeMatrix = "cg_promoters_z") + 
  scale_color_gradientn(colors = c("#005eff", "grey60", "#ffa600")) + scale_size(range = c(1, 12))

2. Compare to a reference if possible

Another method one could use is by comparison to an annotated reference. While few exist, we have put aggregated methylation levels per group over high-confidence PBMC VMRs on Github. First download this data and calculate average methylation levels per cluster for each DMR (windows or any feature can also work)

download.file("https://raw.githubusercontent.com/lrylaarsdam/amethyst/main/vignettes/pbmc_vignette/pbmc_ref.RData", "~/Downloads/pbmc_ref.RData")
ref <- readRDS("~/Downloads/pbmc_ref.RData")
obj@genomeMatrices[["pbmc_vmrs_aggregated"]] <- aggregateMatrix(obj, matrix = "pbmc_vmrs", groupBy = "louvain_vmr_cluster", nmin = 2) # raise nmin for real data

Now view correlation profiles

cor <- cor(merge(ref, 
                 obj@genomeMatrices[["pbmc_vmrs_aggregated"]], 
                 by = 0) |> tibble::column_to_rownames(var = "Row.names"), use = "pairwise.complete.obs")
cor <- cor[c(1:ncol(ref)), c((ncol(ref) + 1)):ncol(cor)]
pheatmap(cor)

Based on all these annotation tools, we can rename our clusters according to broad class using dplyr logic:

obj@metadata[["type"]] <- dplyr::recode(obj@metadata[["louvain_vmr_cluster"]],
                                             "1" = "T", 
                                             "2" = "NK", 
                                             "3" = "B",
                                             "4" = "Mono")

group_labels <- merge(obj@metadata, obj@reductions[["umap_irlba_cg_vmr_pct"]], by = 0) |> dplyr::group_by(type) |> dplyr::summarise(dim_x = mean(dim_x), dim_y = mean(dim_y))
  
library(ggrepel)
dimFeature(obj, colorBy = type, colors = pal, pointSize = 0.8, reduction = "umap_irlba_cg_vmr_pct") +
  geom_text_repel(data = group_labels, aes(x = dim_x, y = dim_y, label = type))

You might also want cluster tracks with the group name. Recalculate, or just copy our previous one:

obj@tracks[["cg_type_tracks"]] <- data.table::copy(obj@tracks[["cg_cluster_tracks"]])
data.table::setnames(obj@tracks[["cg_type_tracks"]], c("chr", "start", "end", "T", "NK", "B", "Mono"))

Now heatMaps will be labeled with the corresponding population:

heatMap(obj, 
        genes = c("CD3D", "KIR2DL4", "MPO", "MPEG1"), 
        track = "cg_type_tracks", 
        nrow = 2,
        legend = T)

DIFFERENTIALLY METHYLATED REGION (DMR) ANALYSIS

There are two main formats to set up DMR analysis. The first is to test DMRs for each cluster against all others. Only the sum matrix (which we saved at the calcSmoothedWindows step) is needed, or regenerate with your annotated cell types:

dmrs <- testDMR(cluster1kbwindows[["sum_matrix"]], eachVsAll = TRUE, nminTotal = 5, nminGroup = 5) 
## Finished group 1
## Finished group 2
## Finished group 3
## Finished group 4

Then expand and filter the resulting list according to the desired stringency.

dmrs <- filterDMR(dmrs, method = "bonferroni", filter = TRUE, pThreshold = 0.01, logThreshold = 1.5)
head(dmrs)
##       chr   start     end   test         pval   logFC         padj direction
##    <char>   <num>   <num> <fctr>        <num>   <num>        <num>    <char>
## 1:   chr1  868000  869000      1 3.782380e-16  3.2990 2.201126e-09     hyper
## 2:   chr1  906000  907000      1 1.489341e-15  3.3188 8.667102e-09     hyper
## 3:   chr1 1347000 1348000      1 7.903053e-19  2.8856 4.599118e-12     hyper
## 4:   chr1 1426000 1427000      1 6.655007e-11 -1.9700 3.872828e-04      hypo
## 5:   chr1 1512000 1513000      1 2.128572e-11  4.2218 1.238705e-04     hyper
## 6:   chr1 1513000 1514000      1 3.860030e-12  2.0069 2.246313e-05     hyper

Especially since the matrix is smoothed, adjacent genomic windows may be significant. You can collapse them with the following function. If annotation = T, any overlapping genes will be noted in the results table.

Note: Please pay attention to the maxDist and minLength parameters and decide if these are appropriate for your data. Here, we are combining any DMRs within 2kb of each other, and applying a minimum threshold of 2kb to count as a DMR.

collapsed_dmrs <- collapseDMR(obj, dmrs, maxDist = 2000, minLength = 2000, reduce = T, annotate = T) 
head(collapsed_dmrs)
## Key: <chr, dmr_start, dmr_end>
##       chr dmr_start dmr_end   test direction dmr_length     dmr_padj dmr_logFC
##    <char>     <num>   <num> <fctr>    <char>      <num>        <num>     <num>
## 1:   chr1   1198000 1200000      3      hypo       2000 1.904499e-04 -1.864500
## 2:   chr1   1329000 1331000      3      hypo       2000 4.978918e-03      -Inf
## 3:   chr1   1512000 1514000      1     hyper       2000 7.316683e-05  3.114350
## 4:   chr1   1736000 1739000      3      hypo       3000 2.690871e-07 -2.597633
## 5:   chr1   2229000 2233000      1      hypo       4000 4.779676e-10 -2.612975
## 6:   chr1   2592000 2594000      2      hypo       2000 4.824348e-08 -2.119050
##                                                     gene_names
##                                                         <char>
## 1:                                                          NA
## 2:                                                          NA
## 3:                                                      ATAD3A
## 4: ENSG00000268575, ENSG00000227775, ENSG00000290854, SLC35E2A
## 5:                                                         SKI
## 6:                                                       MMEL1

The “test” column indicates which cluster is considered the member group. If you are testing a renamed matrix, you might want to add those names To your results instead of having the numerical order in which they were tested.

key <- data.frame(test = as.factor(1:4), 
                type = names(obj@tracks[["cg_type_tracks"]])[4:ncol(obj@tracks[["cg_type_tracks"]])]) # or just specify manually, but that can get tedious
collapsed_dmrs <- left_join(collapsed_dmrs, key, by = "test")
head(collapsed_dmrs)
## Key: <chr, dmr_start, dmr_end>
##       chr dmr_start dmr_end   test direction dmr_length     dmr_padj dmr_logFC
##    <char>     <num>   <num> <fctr>    <char>      <num>        <num>     <num>
## 1:   chr1   1198000 1200000      3      hypo       2000 1.904499e-04 -1.864500
## 2:   chr1   1329000 1331000      3      hypo       2000 4.978918e-03      -Inf
## 3:   chr1   1512000 1514000      1     hyper       2000 7.316683e-05  3.114350
## 4:   chr1   1736000 1739000      3      hypo       3000 2.690871e-07 -2.597633
## 5:   chr1   2229000 2233000      1      hypo       4000 4.779676e-10 -2.612975
## 6:   chr1   2592000 2594000      2      hypo       2000 4.824348e-08 -2.119050
##                                                     gene_names   type
##                                                         <char> <char>
## 1:                                                          NA      B
## 2:                                                          NA      B
## 3:                                                      ATAD3A      T
## 4: ENSG00000268575, ENSG00000227775, ENSG00000290854, SLC35E2A      B
## 5:                                                         SKI      T
## 6:                                                       MMEL1     NK

If specific comparisons are desired, a data frame can be provided describing the tests. Three columns should be included: One listing members of group A, one listing members of group B, and one with the name of the test.

comparisons <- data.frame(
  stringsAsFactors = FALSE,
              name = c("test1", "test2", "test3"),
                 A = c("1,2,3", "1", "2,3"),
                 B = c("1,4", "2", "1")
)
example_dmrs <- testDMR(sumMatrix = cluster1kbwindows[["sum_matrix"]], comparisons = comparisons, nminTotal = 5, nminGroup = 5)

If desired, any results can be stored in the results slot of the Amethyst object. This helps organize all analysis in one structure.

obj@results[["cg_collapsed_dmrs"]] <- collapsed_dmrs

EXPLORING DMR RESULTS

First, let’s look at how many DMRs were identified in each group:

ggplot(collapsed_dmrs |> dplyr::group_by(type, direction) |> dplyr::summarise(n = n()), 
       aes(y = type, x = n, fill = type)) + geom_col() + 
  facet_grid(vars(direction), scales = "free_y") + scale_fill_manual(values = pal) + theme_classic()

Isolate top results per group We find it helpful to select by a combined metric of logFC and padj, but you can modify as necessary:

top_dmrs <- collapsed_dmrs |> 
  dplyr::group_by(type, direction) |> 
  dplyr::arrange(dmr_padj, .by_group = TRUE) |> dplyr::mutate(rank_padj = 1:n()) |>
  dplyr::arrange(desc(abs(dmr_logFC)), .by_group = TRUE) |> dplyr::mutate(rank_logFC = 1:n()) |>
  rowwise() |> dplyr::mutate(total_rank = sum(rank_padj, rank_logFC)) |> 
  group_by(test, direction) |> slice_min(n = 1, order_by = total_rank) |>
  dplyr::mutate(location = paste0(chr, "_", (dmr_start - 2000), "_", (dmr_end + 2000))) |> dplyr::arrange(direction)

Plotting top hypomethylated regions for each group shows expected patterns based on known marker genes:

heatMap(obj, 
        track = "cg_type_tracks", 
        regions = top_dmrs$location[top_dmrs$direction == "hypo"], 
        nrow = 2, 
        legend = F)

You can also turn the DMRs into tracks to plot the exact locations below:

dmr_tracks <- split(collapsed_dmrs |> dplyr::filter(direction == "hypo") |> ungroup() |> dplyr::select(chr, dmr_start, dmr_end, type) |> dplyr::rename(start = dmr_start, end = dmr_end) |> data.table(), collapsed_dmrs[collapsed_dmrs$direction == "hypo"]$type)

heatMap(obj, 
        track = "cg_type_tracks", 
        regions = "chr14_99317000_99324000",
        extraTracks = rev(dmr_tracks),
        extraTrackColors = rev(pal))

Gene ontology (GO) analysis

Further interpretation of the results can be explored using a wide variety of packages available on R. In this example, we will use the topGO package to test for Gene Ontology (GO) term enrichments for genes with hypomethylated regions in the T cell group.

library(topGO)
background <- rownames(obj@genomeMatrices[["cg_promoters"]]) # all genes tested 
query <- unlist(strsplit(collapsed_dmrs$gene_names[collapsed_dmrs$type == "T" & collapsed_dmrs$direction == "hypo"], ", "))

GOdata <- new("topGOdata", 
              description = "GO Enrichment Analysis", 
              ontology = "BP", 
              allGenes = setNames(factor(as.integer(background %in% query), levels = c(0, 1)), background),
              geneSel = function(x) x == 1, 
              nodeSize = 10, 
              annot = annFUN.org, 
              mapping = "org.Hs.eg.db", 
              ID = "symbol")
resultElim <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
resultElim <- GenTable(GOdata, Fisher = resultElim, topNodes = 500, numChar = 60)
resultElim <- resultElim |> dplyr::filter(Fisher < 0.01 & Significant > 5) |> dplyr::mutate(fold_change = Significant/Expected, Fisher = as.numeric(Fisher))
resultElim <- janitor::clean_names(resultElim)

As expected, top results are strongly related to T cell processes:

ggplot(resultElim, aes(x = fold_change, y = reorder(term, fold_change), fill = fisher)) + geom_col() + theme_classic() + scale_fill_viridis_c(direction = -1)

Finally, save your workspace so you don’t have to re-calculate everything in the future.

save.image("~/Downloads/pbmc_vignette_workspace.RData")

CONCLUSION

Thanks for trying out Amethyst! Additional utilities are still to come. We are also open to suggestions. If you use this package and/or code on our Github for your work, please cite our manuscript.

Good luck in your analysis!